13 research outputs found
Data-centric serverless cloud architecture
Serverless has become a new dominant cloud architecture thanks to its high scalability
and flexible, pay-as-you-go billing model. In serverless, developers compose their
cloud services as a set of functions while providers take responsibility for scaling each
functionâs resources according to traffic changes. Hence, the provider needs to timely
spawn, or tear down, function instances (i.e., HTTP servers with user-provider handles),
which cannot hold state across function invocations.
Performance of a modern serverless cloud is bound by data movement. Serverless
architecture separates compute resources and data management to allow function instances
to run on any node in a cloud datacenter. This flexibility comes at the cost of
the necessity to move function initialization state across the entire datacenter when
spawning new instances on demand. Furthermore, to facilitate scaling, cloud providers
restrict the serverless programming model to stateless functions (which cannot hold or
share state across different functions), which lack efficient support for cross-function
communication.
This thesis consists of four following research contributions that pave the way for
a data-centric serverless cloud architecture. First, we introduce STeLLAR, an opensource
serverless benchmarking framework, which enables an accurate performance
characterization of serverless deployments. Using STeLLAR, we study three leading
serverless clouds and identify that all of them follow the same conceptual architecture
that comprises three essential subsystems, namely the worker fleet, the scheduler, and
the storage. Our analysis quantifies the aspect of the data movement problem that is
related to moving state from the storage to workers when spawning function instances
(âcold-startâ delays). Also, we study two state-of-the-art production methods of crossfunction
communication that involve either the storage or the scheduler subsystems, if
the data is transmitted as part of invocation HTTP requests (i.e., inline).
Second, we introduce vHive, an open-source ecosystem for serverless benchmarking
and experimentation, with the goal of enabling researchers to study and innovate across
the entire serverless stack. In contrast to the incomplete academic prototypes and
proprietary infrastructure of the leading commercial clouds, vHive is representative of
the leading clouds and comprises only fully open-source production-grade components,
such as Kubernetes orchestrator and AWS Firecracker hypervisor technologies. To
demonstrate vHiveâs utility, we analyze the cold-start delays, revealing that the high
cold-start latency of function instances is attributable to frequent page faults as the
functionâs state is brought from disk into guest memory one page at a time. Our analysis
further reveals that serverless functions operate over stable working sets - even across
function invocations.
Third, to reduce the cold-start delays of serverless functions, we introduce a novel
snapshotting mechanism that records and prefetches their memory working sets. This
mechanism, called REAP, is implemented in userspace and consists of two phases.
During the first invocation of a function, all accessed memory pages are recorded and
their contents are stored compactly as a part of the function snapshot. Starting from the
second cold invocation, the contents of the recorded pages are retrieved from storage
and installed in the guest memory before the new function instance starts to process the
invocation, allowing to avoid the majority of page faults, hence significantly accelerating
the functionâs cold starts.
Finally, to accelerate the cross-function data communication, we propose Expedited
Data Transfers (XDT), an API-preserving high-performance data communication
method for serverless. In production clouds, function transmit intermediate data to other
functions either inline or through a third-party storage service. The former approach is
restricted to small transfer sizes, the latter supports arbitrary transfers but suffers from
performance and cost overheads. XDT enables direct function-to-function transfers
in a way that is fully compatible with the existing autoscaling infrastructure. With
XDT, a trusted component of the sender function buffers the payload in its memory
and sends a secure reference to the receiver, which is picked by the load balancer and
autoscaler based on the current load. Using the reference, the receiver instance pulls the
transmitted data directly from senderâs memory, obviating the need for intermediary
storage
Bankrupt Covert Channel: Turning Network Predictability into Vulnerability
Recent years have seen a surge in the number of data leaks despite aggressive
information-containment measures deployed by cloud providers. When attackers
acquire sensitive data in a secure cloud environment, covert communication
channels are a key tool to exfiltrate the data to the outside world. While the
bulk of prior work focused on covert channels within a single CPU, they require
the spy (transmitter) and the receiver to share the CPU, which might be
difficult to achieve in a cloud environment with hundreds or thousands of
machines.
This work presents Bankrupt, a high-rate highly clandestine channel that
enables covert communication between the spy and the receiver running on
different nodes in an RDMA network. In Bankrupt, the spy communicates with the
receiver by issuing RDMA network packets to a private memory region allocated
to it on a different machine (an intermediary). The receiver similarly
allocates a separate memory region on the same intermediary, also accessed via
RDMA. By steering RDMA packets to a specific set of remote memory addresses,
the spy causes deep queuing at one memory bank, which is the finest addressable
internal unit of main memory. This exposes a timing channel that the receiver
can listen on by issuing probe packets to addresses mapped to the same bank but
in its own private memory region. Bankrupt channel delivers 74Kb/s throughput
in CloudLab's public cloud while remaining undetectable to the existing
monitoring capabilities, such as CPU and NIC performance counters.Comment: Published in WOOT 2020 co-located with USENIX Security 202
Benchmarking, Analysis, and Optimization of Serverless Function Snapshots
Serverless computing has seen rapid adoption due to its high scalability and
flexible, pay-as-you-go billing model. In serverless, developers structure
their services as a collection of functions, sporadically invoked by various
events like clicks. High inter-arrival time variability of function invocations
motivates the providers to start new function instances upon each invocation,
leading to significant cold-start delays that degrade user experience. To
reduce cold-start latency, the industry has turned to snapshotting, whereby an
image of a fully-booted function is stored on disk, enabling a faster
invocation compared to booting a function from scratch.
This work introduces vHive, an open-source framework for serverless
experimentation with the goal of enabling researchers to study and innovate
across the entire serverless stack. Using vHive, we characterize a
state-of-the-art snapshot-based serverless infrastructure, based on
industry-leading Containerd orchestration framework and Firecracker hypervisor
technologies. We find that the execution time of a function started from a
snapshot is 95% higher, on average, than when the same function is
memory-resident. We show that the high latency is attributable to frequent page
faults as the function's state is brought from disk into guest memory one page
at a time. Our analysis further reveals that functions access the same stable
working set of pages across different invocations of the same function. By
leveraging this insight, we build REAP, a light-weight software mechanism for
serverless hosts that records functions' stable working set of guest memory
pages and proactively prefetches it from disk into memory. Compared to baseline
snapshotting, REAP slashes the cold-start delays by 3.7x, on average.Comment: To appear in ASPLOS 202
Design Guidelines for High-Performance SCM Hierarchies
With emerging storage-class memory (SCM) nearing commercialization, there is
evidence that it will deliver the much-anticipated high density and access
latencies within only a few factors of DRAM. Nevertheless, the
latency-sensitive nature of memory-resident services makes seamless integration
of SCM in servers questionable. In this paper, we ask the question of how best
to introduce SCM for such servers to improve overall performance/cost over
existing DRAM-only architectures. We first show that even with the most
optimistic latency projections for SCM, the higher memory access latency
results in prohibitive performance degradation. However, we find that
deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the
performance of an SCM-mostly memory system competitive. The high degree of
spatial locality that memory-resident services exhibit not only simplifies the
DRAM cache's design as page-based, but also enables the amortization of
increased SCM access latencies and the mitigation of SCM's read/write latency
disparity.
We identify the set of memory hierarchy design parameters that plays a key
role in the performance and cost of a memory system combining an SCM technology
and a 3D stacked DRAM cache. We then introduce a methodology to drive
provisioning for each of these design parameters under a target
performance/cost goal. Finally, we use our methodology to derive concrete
results for specific SCM technologies. With PCM as a case study, we show that a
two bits/cell technology hits the performance/cost sweet spot, reducing the
memory subsystem cost by 40% while keeping performance within 3% of the best
performing DRAM-only system, whereas single-level and triple-level cell
organizations are impractical for use as memory replacements.Comment: Published at MEMSYS'1
Expedited Data Transfers for Serverless Clouds
Serverless computing has emerged as a popular cloud deployment paradigm. In
serverless, the developers implement their application as a set of chained
functions that form a workflow in which functions invoke each other. The cloud
providers are responsible for automatically scaling the number of instances for
each function on demand and forwarding the requests in a workflow to the
appropriate function instance. Problematically, today's serverless clouds lack
efficient support for cross-function data transfers in a workflow, preventing
the efficient execution of data-intensive serverless applications. In
production clouds, functions transmit intermediate, i.e., ephemeral, data to
other functions either as part of invocation HTTP requests (i.e., inline) or
via third-party services, such as AWS S3 storage or AWS ElastiCache in-memory
cache. The former approach is restricted to small transfer sizes, while the
latter supports arbitrary transfers but suffers from performance and cost
overheads. This work introduces Expedited Data Transfers (XDT), an
API-preserving high-performance data communication method for serverless that
enables direct function-to-function transfers. With XDT, a trusted component of
the sender function buffers the payload in its memory and sends a secure
reference to the receiver, which is picked by the load balancer and autoscaler
based on the current load. Using the reference, the receiver instance pulls the
transmitted data directly from the sender's memory. XDT is natively compatible
with existing autoscaling infrastructure, preserves function invocation
semantics, is secure, and avoids the cost and performance overheads of using an
intermediate service for data transfers. We prototype our system in
vHive/Knative deployed on a cluster of AWS EC2 nodes, showing that XDT improves
latency, bandwidth, and cost over AWS S3 and ElasticCache.Comment: latest versio
Mitigating Load Imbalance in Distributed Data Serving with Rack-Scale Memory Pooling
To provide low-latency and high-throughput guarantees, most large key-value stores keep the data in the memory of many servers. Despite the natural parallelism across lookups, the load imbalance, introduced by heavy skew in the popularity distribution of keys, limits performance. To avoid violating tail latency service-level objectives, systems tend to keep server utilization low and organize the data in micro-shards, which provides units of migration and replication for the purpose of load balancing. These techniques reduce the skew but incur additional monitoring, data replication, and consistency maintenance overheads. In this work, we introduce RackOut, a memory pooling technique that leverages the one-sided remote read primitive of emerging rack-scale systems to mitigate load imbalance while respecting service-level objectives. In RackOut, the data are aggregated at rack-scale granularity, with all of the participating servers in the rack jointly servicing all of the rackâs micro-shards. We develop a queuing model to evaluate the impact of RackOut at the datacenter scale. In addition, we implement a RackOut proof-of-concept key value store, evaluate it on two experimental platforms based on RDMA and Scale-Out NUMA, and use these results to validate the model. We devise two distinct approaches to load balancing within a RackOut unit, one based on random selection of nodesâRackOut_staticâand another one based on an adaptive load balancing mechanismâ RackOut_adaptive. Our results show that RackOut_static increases throughput by up to 6Ă for RDMA and 8.6Ă for Scale-Out NUMA compared to a scale-out deployment, while respecting tight tail latency service-level objectives. RackOut_adaptive improves the throughput by 30% for workloads with 20% of writes over RackOut_static